- name: d8.node-group-update
  rules:
  - alert: D8NodeIsNotUpdating
    expr: |
      max by (node,node_group) (
        node_group_node_status{status="ToBeUpdated"} *
        on(node) group_left() (max by(node) ((kube_node_status_condition{condition="Ready", status="true"} == 1)))
      ) > 0
    for: 5m
    labels:
      tier: cluster
      severity_level: "9"
    annotations:
      plk_markup_format: markdown
      plk_protocol_version: "1"
      plk_create_group_if_not_exists__d8_cluster_has_problems_with_nodes_updates: "D8ClusterHasProblemsWithNodesUpdates,tier=cluster,prometheus=deckhouse,kubernetes=~kubernetes"
      plk_grouped_by__d8_cluster_has_problems_with_nodes_updates: "D8ClusterHasProblemsWithNodesUpdates,tier=cluster,prometheus=deckhouse,kubernetes=~kubernetes"
      summary: The {{ $labels.node }} Node does not update.
      description: |
        There is a new update for the {{ $labels.node }} Node of the {{ $labels.node_group }} group but it has not received the update nor trying to.

        Most likely Bashible for some reason is not handling the update correctly. At this point, it must add the `update.node.deckhouse.io/waiting-for-approval` annotation to the Node so that it can be approved.

        You can find out the most current version of the update using this command:
        ```shell
        kubectl -n d8-cloud-instance-manager get secret configuration-checksums -o jsonpath={.data.{{ $labels.node_group }}} | base64 -d
        ```

        Use the following command to find out the version on the Node:
        ```shell
        kubectl get node {{ $labels.node }} -o jsonpath='{.metadata.annotations.node\.deckhouse\.io/configuration-checksum}'
        ```

        Here is how you can view Bashible logs on the Node:
        ```shell
        journalctl -fu bashible
        ```

  - alert: D8NodeIsNotUpdating
    expr: |
      max by (node,node_group) (
        node_group_node_status{status="Approved"} *
        on(node) group_left() (max by(node) ((kube_node_status_condition{condition="Ready", status="true"} == 1)))
      )> 0
    for: 10m
    labels:
      tier: cluster
      severity_level: "8"
    annotations:
      plk_markup_format: markdown
      plk_protocol_version: "1"
      plk_create_group_if_not_exists__d8_cluster_has_problems_with_nodes_updates: "D8ClusterHasProblemsWithNodesUpdates,tier=cluster,prometheus=deckhouse,kubernetes=~kubernetes"
      plk_grouped_by__d8_cluster_has_problems_with_nodes_updates: "D8ClusterHasProblemsWithNodesUpdates,tier=cluster,prometheus=deckhouse,kubernetes=~kubernetes"
      summary: The {{ $labels.node }} Node cannot complete the update.
      description: |
        There is a new update for the {{ $labels.node }} Node of the {{ $labels.node_group }} group}; the Node has learned about the update, requested and received approval, but cannot complete the update.

        Here is how you can view Bashible logs on the Node:
        ```shell
        journalctl -fu bashible
        ```

  - alert: D8NodeIsNotUpdating
    expr: |
      max by (node,node_group) (
        node_group_node_status{status="DisruptionApproved"} *
        on(node) group_left() (max by(node) ((kube_node_status_condition == 1)))
      )> 0
    for: 20m
    labels:
      tier: cluster
      severity_level: "7"
    annotations:
      plk_markup_format: markdown
      plk_protocol_version: "1"
      plk_create_group_if_not_exists__d8_cluster_has_problems_with_nodes_updates: "D8ClusterHasProblemsWithNodesUpdates,tier=cluster,prometheus=deckhouse,kubernetes=~kubernetes"
      plk_grouped_by__d8_cluster_has_problems_with_nodes_updates: "D8ClusterHasProblemsWithNodesUpdates,tier=cluster,prometheus=deckhouse,kubernetes=~kubernetes"
      plk_cause_of__node_unschedulable: "NodeUnschedulable,tier=cluster,prometheus=deckhouse,node={{ $labels.node }}"
      summary: The {{ $labels.node }} Node cannot complete the update.
      description: |
        There is a new update for the {{ $labels.node }} Node of the {{ $labels.node_group }} group; the Node has learned about the update, requested and received approval, started the update, ran into a step that causes possible downtime. The update manager (the update_approval hook of node-group module) performed the update, and the Node received downtime approval. However, there is no success message about the update.

        Here is how you can view Bashible logs on the Node:
        ```shell
        journalctl -fu bashible
        ```

  - alert: D8NodeUpdateStuckWaitingForDisruptionApproval
    expr: |
      max by (node,node_group) (
        node_group_node_status{status="WaitingForDisruptionApproval"} *
        on(node) group_left() (max by(node) (kube_node_status_condition == 1))
      )> 0
    for: 5m
    labels:
      tier: cluster
      severity_level: "8"
    annotations:
      plk_markup_format: markdown
      plk_protocol_version: "1"
      plk_create_group_if_not_exists__d8_cluster_has_problems_with_nodes_updates: "D8ClusterHasProblemsWithNodesUpdates,tier=cluster,prometheus=deckhouse,kubernetes=~kubernetes"
      plk_grouped_by__d8_cluster_has_problems_with_nodes_updates: "D8ClusterHasProblemsWithNodesUpdates,tier=cluster,prometheus=deckhouse,kubernetes=~kubernetes"
      summary: The {{ $labels.node }} Node cannot get disruption approval.
      description: |
        There is a new update for the {{ $labels.node }} Node of the {{ $labels.node_group }} group; the Node has learned about the update, requested and received approval, started the update, and ran into a stage that causes possible downtime. For some reason, the Node cannot get that approval (it is issued fully automatically by the `update_approval` hook of the `node-manager`).

  - alert: D8NodeGroupIsNotUpdating
    expr: |
      count by (node_group) (
        node_group_node_status{status="WaitingForApproval"} *
        on(node) group_left() (max by(node) ((kube_node_status_condition == 1)))
      ) > 0 and (
        count by (node_group) (
          node_group_node_status{status="Approved"} *
          on(node) group_left() (max by(node) ((kube_node_status_condition == 1)))
        ) == 0
      )
    for: 5m
    labels:
      tier: cluster
      severity_level: "8"
    annotations:
      plk_markup_format: markdown
      plk_protocol_version: "1"
      plk_create_group_if_not_exists__d8_cluster_has_problems_with_nodes_updates: "D8ClusterHasProblemsWithNodesUpdates,tier=cluster,prometheus=deckhouse,kubernetes=~kubernetes"
      plk_grouped_by__d8_cluster_has_problems_with_nodes_updates: "D8ClusterHasProblemsWithNodesUpdates,tier=cluster,prometheus=deckhouse,kubernetes=~kubernetes"
      summary: The {{ $labels.node_group }} node group is not handling the update correctly.
      description: |
        There is a new update for Nodes of the {{ $labels.node_group }} group; Nodes have learned about the update. However, no Node can get approval to start updating.

        Most likely, there is a problem with the `update_approval` hook of the `node-manager` module.


  - alert: D8ProblematicNodeGroupConfiguration
    expr: |
      max by (status, node_group, node) (node_group_node_status{status="UpdateFailedNoConfigChecksum"}) == 1
    for: 5m
    labels:
      tier: cluster
      severity_level: "8"
    annotations:
      plk_markup_format: markdown
      plk_protocol_version: "1"
      plk_create_group_if_not_exists__d8_cluster_has_problems_with_nodes_updates: "D8ClusterHasProblemsWithNodesUpdates,tier=cluster,prometheus=deckhouse,kubernetes=~kubernetes"
      plk_grouped_by__d8_cluster_has_problems_with_nodes_updates: "D8ClusterHasProblemsWithNodesUpdates,tier=cluster,prometheus=deckhouse,kubernetes=~kubernetes"
      summary: The {{ $labels.node }} Node cannot begin the update.
      description: |
        There is a new update for Nodes of the {{ $labels.node_group }} group; Nodes have learned about the update. However, {{ $labels.node }} Node cannot be updated.

        Node {{ $labels.node }} has no `node.deckhouse.io/configuration-checksum` annotation.
        Perhaps the bootstrap process of the Node did not complete correctly. Check the `cloud-init` logs (/var/log/cloud-init-output.log) of the Node.
        There is probably a problematic NodeGroupConfiguration resource for {{ $labels.node_group }} NodeGroup.

  - alert: NodeRequiresDisruptionApprovalForUpdate
    expr: |
      max by (node,node_group) (
        node_group_node_status{status="WaitingForManualDisruptionApproval"} *
        on(node) group_left() (max by(node) ((kube_node_status_condition == 1)))
      )> 0
    labels:
      tier: cluster
      severity_level: "8"
    annotations:
      plk_markup_format: markdown
      plk_protocol_version: "1"
      plk_create_group_if_not_exists__cluster_has_nodes_requiring_disruption_approval_for_update: "ClusterHasNodesRequiringDisruptionApprovalForUpdate,tier=cluster,prometheus=deckhouse,kubernetes=~kubernetes"
      plk_grouped_by__cluster_has_nodes_requiring_disruption_approval_for_update: "ClusterHasNodesRequiringDisruptionApprovalForUpdate,tier=cluster,prometheus=deckhouse,kubernetes=~kubernetes"
      summary: The {{ $labels.node }} Node requires disruption approval to proceed with the update
      description: |
        There is a new update for Nodes and the {{ $labels.node }} Node of the {{ $labels.node_group }} group has learned about the update, requested and received approval, started the update, and ran into a step that causes possible downtime.

        You have to manually approve the disruption since the `Manual` mode is active in the group settings (`disruptions.approvalMode`).

        Grant approval to the Node using the `update.node.deckhouse.io/disruption-approved=` annotation if it is ready for unsafe updates (e.g., drained).

        **Caution!!!** The Node will not be drained automatically since the manual mode is enabled (`disruptions.approvalMode: Manual`).

        **Caution!!!** No need to drain the master node.

        * Use the following commands to drain the Node and grant it update approval:
          ```shell
          kubectl drain {{ $labels.node }} --delete-local-data=true --ignore-daemonsets=true --force=true &&
            kubectl annotate node {{ $labels.node }} update.node.deckhouse.io/disruption-approved=
          ```
        * Note that you need to **uncordon the node** after the update is complete (i.e., after removing the `update.node.deckhouse.io/approved` annotation).
          ```
          while kubectl get node {{ $labels.node }} -o json | jq -e '.metadata.annotations | has("update.node.deckhouse.io/approved")' > /dev/null; do sleep 1; done
          kubectl uncordon {{ $labels.node }}
          ```

        Note that if there are several Nodes in a NodeGroup, you will need to repeat this operation for each Node since only one Node can be updated at a time. Perhaps it makes sense to temporarily enable the automatic disruption approval mode (`disruptions.approvalMode: Automatic`).

  - alert: NodeStuckInDrainingForDisruptionDuringUpdate
    expr: |
      max by (node,node_group) (
        node_group_node_status{status="DrainingForDisruption"} *
        on(node) group_left() (max by(node) ((kube_node_status_condition == 1)))
      )> 0
    for: 2h
    labels:
      tier: cluster
      severity_level: "6"
    annotations:
      plk_markup_format: markdown
      plk_protocol_version: "1"
      plk_create_group_if_not_exists__cluster_has_nodes_stuck_in_draining_for_disruption_during_update: "ClusterHasNodesStuckInDrainingForDisruptionDuringUpdate,tier=cluster,prometheus=deckhouse,kubernetes=~kubernetes"
      plk_grouped_by__cluster_has_nodes_stuck_in_draining_for_disruption_during_update: "ClusterHasNodesStuckInDrainingForDisruptionDuringUpdate,tier=cluster,prometheus=deckhouse,kubernetes=~kubernetes"
      summary: The {{ $labels.node }} Node is stuck in draining.
      description: |
        There is a new update for the {{ $labels.node }} Node of the {{ $labels.node_group }} NodeGroup. The Node has learned about the update, requested and received approval, started the update, ran into a step that causes possible downtime, and stuck in draining in order to get disruption approval automatically.

        You can get more info by running: `kubectl -n default get event --field-selector involvedObject.name={{ $labels.node }},reason=ScaleDown --sort-by='.metadata.creationTimestamp'`

  - alert: NodeStuckInDraining
    expr: |
      max by (message, node, node_group) (
        d8_node_draining *
        on (node) group_left (node_group) node_group_node_status{status="Draining"} *
        on (node) group_left () (max by (node) ((kube_node_status_condition == 1)))
      ) > 0
    for: 5m
    labels:
      tier: cluster
      severity_level: "6"
    annotations:
      plk_markup_format: markdown
      plk_protocol_version: "1"
      plk_create_group_if_not_exists__cluster_has_nodes_stuck_in_draining: "ClusterHasNodesStuckInDraining,tier=cluster,prometheus=deckhouse,kubernetes=~kubernetes"
      plk_grouped_by__cluster_has_nodes_stuck_in_draining: "ClusterHasNodesStuckInDraining,tier=cluster,prometheus=deckhouse,kubernetes=~kubernetes"
      plk_labels_as_annotations: message
      summary: The {{ $labels.node }} Node is stuck in draining.
      description: |
        The {{ $labels.node }} Node of the {{ $labels.node_group }} NodeGroup stuck in draining.

        You can get more info by running: `kubectl -n default get event --field-selector involvedObject.name={{ $labels.node }},reason=DrainFailed --sort-by='.metadata.creationTimestamp'`

        The error message is: {{ $labels.message }}

  - alert: D8BashibleApiserverLocked
    expr: d8_bashible_apiserver_locked == 1
    for: 15m
    labels:
      tier: cluster
      severity_level: "6"
    annotations:
      plk_markup_format: markdown
      plk_protocol_version: "1"
      summary: Bashible-apiserver is locked for too long
      description: |
        Check bashible-apiserver pods are up-to-date and running `kubectl -n d8-cloud-instance-manager get pods -l app=bashible-apiserver`
